Research on efficient resource utilization in data Intensive Distributed Systems
ثبت نشده
چکیده
In recent years there has been an unprecedented growth in the amount of data being gathered worldwide. Data may include business data, sensor data, web data, log data, and social data. Data volumes of terabytes are common, petabytes are not unusual anymore, and it is almost certainly not long before exabyte scale data becomes the norm. Big data presents an opportunity for in-depth analysis. Trends and patterns can be derived from the data, enabling a wide variety of business and research opportunities. As a result, the ability to process large amounts of data—often too large to be stored and processed by traditional relational database systems—has become very important. Processing big data requires very large scale resources such as large clusters of commodity machines that allow the data to be processed in a distributed fashion. Google’s MapReduce framework and its open source implementation Hadoop have emerged as the de facto standard data processing platform for such environments. At the same time cloud computing has emerged, enabling users to dynamically provision computing resources and pay only for usage on a pay-as-you-go basis. Cloud computing enables anyone to gain access to large scale computing resources without the need to create their own infrastructure, drastically driving down the cost and making it possible for even individuals or small organizations to afford the resources necessary for processing big data. Cloud computing adoption has been driven by advances in virtualization and the wide-spread availability of highbandwidth network connectivity. When dealing with resources at this level of scale it becomes increasingly difficult to efficiently utilize all resources. With the billing model used by most cloud providers, provisioning a single node for ten hours costs the same as provisioning ten nodes for one hour, so it would be highly desirable if distributed applications could achieve a linear speed-up when more nodes or nodes with higher hardware specifications are provisioned. However, a number of factors limit the scalability of these applications so they cannot fully use the provisioned resources. In the cloud, inefficient resource utilization directly leads to higher costs for no additional benefit.
منابع مشابه
A New Job Scheduling in Data Grid Environment Based on Data and Computational Resource Availability
Data Grid is an infrastructure that controls huge amount of data files, and provides intensive computational resources across geographically distributed collaboration. The heterogeneity and geographic dispersion of grid resources and applications place some complex problems such as job scheduling. Most existing scheduling algorithms in Grids only focus on one kind of Grid jobs which can be data...
متن کاملAdaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملA new Shuffled Genetic-based Task Scheduling Algorithm in Heterogeneous Distributed Systems
Distributed systems such as Grid- and Cloud Computing provision web services to their users in all of the world. One of the most important concerns which service providers encounter is to handle total cost of ownership (TCO). The large part of TCO is related to power consumption due to inefficient resource management. Task scheduling module as a key component can has drastic impact on both user...
متن کاملDisaggregation in the Cloud with μInstances and Cirrus
Resource disaggregation can provide significant improvements in the utilization of resources in the datacenter. A Google cluster trace analysis confirms that up to 70% of memory may be recovered with resource disaggregation. However, resource disaggregation in the cloud is currently unfeasible due to the hardware and network changes required by previously proposed designs. We make the observati...
متن کاملEfficient Resource Utilization in Hadoop on Virtual Machine
Hadoop is one of open source software technology that is used for processing large amount of data across clusters of commodity servers in distributed manner. Mainly it is designed to provide high fault tolerance and scale up a single server to thousands numbers of machines. Hadoop uses Hadoop distributed file system (HDFS) which is open source implementation of Google File System (GFS) for data...
متن کاملE2DR: Energy Efficient Data Replication in Data Grid
Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...
متن کامل